Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition

نویسندگان

چکیده

Multimodal emotion recognition is a challenging task in computing as it quite difficult to extract discriminative features identify the subtle differences human emotions with abstract concept and multiple expressions. Moreover, how fully utilize both audio visual information still an open problem. In this paper, we propose novel multimodal fusion attention network for audio-visual based on adaptive multi-level factorized bilinear pooling (FBP). First, stream, convolutional (FCN) equipped 1-D mechanism local response normalization designed speech recognition. Next, global FBP (G-FBP) approach presented perform by integrating self-attention video stream proposed stream. To improve G-FBP, strategy (AG-FBP) dynamically calculate weight of two modalities devised emotion-related representation vectors from respective modalities. Finally, information, (AM-FBP) introduced combining global-trunk intra-trunk data one recording top AG-FBP. Tested IEMOCAP corpus only new FCN method outperforms state-of-the-art results accuracy 71.40%. validated AFEW database EmotiW2019 sub-challenge recognition, AM-FBP achieves best 63.09% 75.49% respectively test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both visual content of images and textual content of questions. To support the VQA task, we need to find good solutions for the following three issues: 1) fine-grained feature representations for both the image and the question; 2) multi-modal feature fusion that is able to capture the complex int...

متن کامل

Semantic Audio-Visual Data Fusion for Automatic Emotion Recognition

The paper describes a novel technique for the recognition of emotions from multimodal data. We focus on the recognition of the six prototypic emotions. The results from the facial expression recognition and from the emotion recognition from speech are combined using a bi-modal multimodal semantic data fusion model that determines the most probable emotion of the subject. Two types of models bas...

متن کامل

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

This paper focuses on two key problems for audiovisual emotion recognition in the video. One is the audio and visual streams temporal alignment for feature level fusion. The other one is locating and re-weighting the perception attentions in the whole audiovisual stream for better recognition. The Long Short Term Memory Recurrent Neural Network (LSTM-RNN) is employed as the main classification ...

متن کامل

Low-level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

Bimodal emotion recognition through audiovisual feature fusion has been shown superior over each individual modality in the past. Still, synchronization of the two streams is a challenge, as many vision approaches work on a frame basis opposing audio turnor chunk-basis. Therefore, late fusion schemes such as simple logic or voting strategies are commonly used for the overall estimation of under...

متن کامل

Multi-Focus Image Fusion in DCT Domain using Variance and Energy of Laplacian and Correlation Coefficient for Visual Sensor Networks

The purpose of multi-focus image fusion is gathering the essential information and the focused parts from the input multi-focus images into a single image. These multi-focus images are captured with different depths of focus of cameras. A lot of multi-focus image fusion techniques have been introduced using considering the focus measurement in the spatial domain. However, the multi-focus image ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2021

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2021.3096037